An Amazon SageMaker Container for Hugging Face Inference on AWS Graviton

Happy to share my new GitHub project: “An Amazon SageMaker Container for Hugging Face Inference on AWS Graviton”.

✅ Based on a clean source build of llama.cpp
✅ Native integration with the SageMaker SDK and with Graviton3/Graviton4 instances
✅ Model deployment from the Hugging Face hub or an Amazon S3 bucket
✅ Deployment of existing GGUF models
✅ Deployment of safetensors models, with automatic GGUF conversion and quantization
✅ Support for OpenAI API
✅ Support for streaming and non-streaming text generation
✅ Support for all llama-server flags

I’ve been trying for a while to collaborate with AWS and Arm on this project, and I got tired of waiting 😴

So there. Enjoy! Please give it a ⭐️ if you find it useful.

GitHub repository: https://github.com/juliensimon/sagemaker-inference-container-graviton